-
Notifications
You must be signed in to change notification settings - Fork 213
OCPBUGS-25708: pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls #1009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-25708: pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls #1009
Conversation
965bfb2 (pkg/cvo/availableupdates: Requeue risk evaluation on failure, 2023-09-18, openshift#939) pivoted from "every syncAvailableUpdates round that does anything useful has a fresh Cincinnati pull" to "some syncAvailableUpdates rounds have a fresh Cincinnati pull, but others just re-eval some Recommended=Unknown conditional updates". Then syncAvailableUpdates calls setAvailableUpdates. However, until this commit, setAvailableUpdates had been bumping LastAttempt every time, even in the just-re-eval conditional updates" case. That meant we never tripped the: } else if !optrAvailableUpdates.RecentlyChanged(optr.minimumUpdateCheckInterval) { klog.V(2).Infof("Retrieving available updates again, because more than %s has elapsed since %s", optr.minimumUpdateCheckInterval, optrAvailableUpdates.LastAttempt.Format(time.RFC3339)) condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like: 1. Cincinnati serves vulnerable PromQL, like [1]. 2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues. 3. Cincinnati PromQL fixed, like [2]. 4. Cases: a. Before 965bfb2, and also after this commit, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray! b. Clusters with 965bfb2 but without this commit say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues. To break out of 4.b, clusters on impacted releases can roll their CVO pod: $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pod which will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method... [1]: openshift/cincinnati-graph-data#4524 [2]: openshift/cincinnati-graph-data#4528
eaea152 to
e2d6af5
Compare
|
@wking: This pull request references Jira Issue OCPBUGS-25708, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cc |
|
Ok, Cluster Bot testing with $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo/cincinnati-graph.json"}]'
$ oc adm upgrade channel demo
$ oc get -o json clusterversion version | jq '.status.conditionalUpdates[] | {conditions, risks: ([.risks[] | {name, promql: .matchingRules[0].promql.promql}])}'
{
"conditions": [
{
"lastTransitionTime": "2023-12-20T19:03:57Z",
"message": "The update is recommended, because none of the conditional update risks apply to this cluster.",
"reason": "AsExpected",
"status": "True",
"type": "Recommended"
}
],
"risks": [
{
"name": "A",
"promql": "cluster_operator_conditions"
},
{
"name": "B",
"promql": "group(cluster_version_available_updates{channel=\"buggy\"})\nor\n0 * group(cluster_version_available_updates{channel!=\"buggy\"})"
},
{
"name": "C",
"promql": "group(csv_succeeded{name=~\"local-storage-operator[.].*\"}) or 0 * group(csv_count)"
},
{
"name": "D",
"promql": "0 * max(cluster_version)"
},
{
"name": "E",
"promql": "0 * 0 * max(cluster_version)"
}
]
}I dunno how the |
|
Ok, new Cluster Bot round with $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/upstream", "value": "https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo/cincinnati-graph.json"}]'
$ oc adm upgrade channel demo
$ oc adm upgrade --include-not-recommended
Cluster version is 4.16.0-0.test-2023-12-20-194115-ci-ln-hyd2jpt-latest
Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo/cincinnati-graph.json
Channel: demo
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
Supported but not recommended updates:
Version: 4.16.1
Image: quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000
Recommended: Unknown
Reason: EvaluationFailed
Message: Could not evaluate exposure to update risk A (invalid PromQL result length must be one, but is 147)
A description: A.
A URL: https://bug.example.com/aSo hooray, we're sticking on multiple matches now. Not sure why we didn't last time... Continuing with the OCPBUGS-25708 reproducer proceedure: $ sed -i 's/cluster_operator_conditions/group(cluster_operator_conditions)/' cincinnati-graph.json
$ git commit -am 'Fix broken PromQL'
$ git push wking demo
$ sleep 600
$ oc adm upgrade --include-not-recommended
Cluster version is 4.16.0-0.test-2023-12-20-194115-ci-ln-hyd2jpt-latest
Upstream: https://raw.githubusercontent.com/wking/cincinnati-graph-data/demo/cincinnati-graph.json
Channel: demo
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
Supported but not recommended updates:
Version: 4.16.1
Image: quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000
Recommended: False
Reason: A
Message: A. https://bug.example.com/aSo it noticed the PromQL expression fix, and went from |
|
/assign @LalatenduMohanty |
@wking Not sure what you meant here. Can you please explain? |
|
/lgtm |
In my first verification attempt, when I fed the cluster In my second verification attempt, I mixed in #1010 for debugging, but didn't end up needing it, because we got |
petr-muller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Pretty nasty failure case - it's quite easy in hindsight, but hard to figure out before you know it's there ;)
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LalatenduMohanty, petr-muller, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/label qe-approved |
|
@wking: This pull request references Jira Issue OCPBUGS-25708, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/jira refresh The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity. |
|
@openshift-bot: This pull request references Jira Issue OCPBUGS-25708, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
cluster-baremetal-operator crash-looping is getting addressed in openshift/cluster-baremetal-operator#395. I dunno about the /override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-ovn-upgrade-out-of-change DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@wking: Jira Issue OCPBUGS-25708: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-25708 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick release-4.15 |
|
@wking: new pull request created: #1013 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-version-operator-container-v4.16.0-202401022050.p0.g5d3c08b.assembly.stream for distgit cluster-version-operator. |
965bfb2 (#939) pivoted from "every
syncAvailableUpdatesround that does anything useful has a fresh Cincinnati pull" to "somesyncAvailableUpdatesrounds have a fresh Cincinnati pull, but others just re-eval someRecommended=Unknownconditional updates". ThensyncAvailableUpdatescallssetAvailableUpdates.However, until this commit,
setAvailableUpdateshad been bumpingLastAttemptevery time, even in the just-re-eval conditional updates" case. That meant we never tripped the:condition to trigger a fresh Cincinnati pull. Which could lead to deadlocks like:
To break out of 4.b, clusters on impacted releases can roll their CVO pod:
$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator podwhich will clear out LastAttempt and trigger a fresh Cincinnati pull. I'm not sure if there's another recovery method...